n-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching

نویسندگان

Min-Soo Kim

Kyu-Young Whang

Jae-Gil Lee

چکیده

Approximate string matching is to find all the occurrences of a query string in a text database allowing a specified number of errors. Approximate string matching based on the n-gram inverted index (simply, n-gram Matching) has been widely used. A major reason is that it is scalable for large databases since it is not a main memory algorithm. Nevertheless, n-gram Matching also has drawbacks: the query performance tends to be bad, and many false positives occur if a large number of errors are allowed. In this paper, we propose an inverted index structure, which we call the n-gram/2L-Approximation index, that improves these drawbacks and an approximate string matching algorithm based on it. The n-gram/2L-Approximation is an adaptation of the n-gram/2L index [4], which the authors have proposed earlier for exact matching. Inheriting the advantages of the n-gram/2L index, the n-gram/2L-Approximation index reduces the size of the index and improves the query performance compared with the n-gram inverted index. In addition, the n-gram/2L-Approximation index reduces false positives compared with the n-gram inverted index if a large number of errors are allowed. We perform extensive experiments using the text and protein databases. Experimental results using databases of 1GBytes show that the n-gram/2L-Approximation index reduces the index size by up to 1.8 times and, at the same time, improves the query performance by up to 4.2 times compared with those of the n-gram inverted index.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure

The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the ...

متن کامل

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

In this paper, we propose a novel approach for assisting human bug triagers in large open source software projects by semi-automating the bug assignment process. Our approach employs a simple and efficient n-gram-based algorithm for approximate string matching on the character level. We propose and implement a recommender prototype which collects the natural language textual information availab...

متن کامل

Nikolaus Augsten Approximate Matching of Hierarchical Data

The goal of this thesis is to design, develop, and evaluate new methods for the approximate matching of hierarchical data represented as labeled trees. In approximate matching scenarios two items should be matched if they are similar. Computing the similarity between labeled trees is hard as in addition to the data values also the structure must be considered. A well-known measure for comparing...

متن کامل

An edit operation-based approach to approximate string matching in large DNA databases

In DNA related research, due to various environment conditions, mutations occur very often, where a mutation is defined as a heritable change in the DNA sequence. Therefore, approximate string matching is applied to answer those queries which find mutations. The problem of approximate string matching is that given a user specified parameter, k, we want to find where the substrings, which could ...

متن کامل

s-grams: Defining generalized n-grams for information retrieval

n-grams have been used widely and successfully for approximate string matching in many areas. s-grams have been introduced recently as an n-gram based matching technique, where di-grams are formed of both adjacent and non-adjacent characters. s-grams have proved successful in approximate string matching across language boundaries in Information Retrieval (IR). s-grams however lack precise defin...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Comput. Syst. Sci. Eng.

دوره 22 شماره

صفحات -

تاریخ انتشار 2007

n-Gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching

نویسندگان

چکیده

منابع مشابه

n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

Nikolaus Augsten Approximate Matching of Hierarchical Data

An edit operation-based approach to approximate string matching in large DNA databases

s-grams: Defining generalized n-grams for information retrieval

عنوان ژورنال:

اشتراک گذاری